Working with DataFrames
Both SimpleSDMLayers.jl and GBIF.jl offer an optional integration with the DataFrames.jl package. Therefore, our previous example with the kingfisher Megaceryle alcyon could also be approached with a DataFrame-centered workflow.
We will illustrate this using the same data and producing the same figures as in the previous example. To do so, we will use GBIF.jl to produce the occurrence DataFrame we will use throughout this example. However, it is also possible to use a DataFrame of your choosing instead of one generated by GBIF.jl, as long as it holds one occurrence per row, a column with the latitude coordinates, and a column with longitude coordinates. For the rest, it can hold whatever information you like. Most of our functions assume by default that the coordinates are stored in columns named :latitude and :longitude (the order doesn't matter), but you can generally specify other names with latitude = :lat in case you don't want to rename them (we will show you how below).
So let's start by getting our data:
# Load packages
using SimpleSDMLayers
using GBIF
using Plots
using Statistics
# Load DataFrames too
using DataFrames
# Load environmental data
temperature, precipitation = worldclim([1,12])
# Get GBIF occurrences
kingfisher = GBIF.taxon("Megaceryle alcyon", strict=true)
kf_occurrences = occurrences(kingfisher,
"hasCoordinate" => "true",
"decimalLatitude" => (0.0, 65.0),
"decimalLongitude" => (-180.0, -50.0),
"limit" => 200)
for i in 1:4
occurrences!(kf_occurrences)
end
@info kf_occurrences[ Info: DataFrames integration loaded [ Info: Loading DataFrames support for GBIF.jl [ Info: GBIF records: downloaded 1000 out of 100000
Once the data is loaded, we can easily convert the environmental layers to a DataFrame with the corresponding coordinates. We can do this for a single layer:
temperature_df = DataFrame(temperature)
first(temperature_df, 5)| latitude | longitude | values | |
|---|---|---|---|
| Float64 | Float64 | Union… | |
| 1 | -89.9167 | -179.917 | -31.0171 |
| 2 | -89.75 | -179.917 | -30.3919 |
| 3 | -89.5833 | -179.917 | -33.4822 |
| 4 | -89.4167 | -179.917 | -33.6104 |
| 5 | -89.25 | -179.917 | -33.7199 |
Or for multiple layers at the same time:
env_layers = [temperature, precipitation]
env_df = DataFrame(env_layers)
rename!(env_df, :x1 => :temperature, :x2 => :precipitation)
first(env_df, 5)| longitude | latitude | temperature | precipitation | |
|---|---|---|---|---|
| Float64 | Float64 | Union… | Union… | |
| 1 | -179.917 | -89.9167 | -31.0171 | 43.0 |
| 2 | -179.917 | -89.75 | -30.3919 | 40.0 |
| 3 | -179.917 | -89.5833 | -33.4822 | 40.0 |
| 4 | -179.917 | -89.4167 | -33.6104 | 37.0 |
| 5 | -179.917 | -89.25 | -33.7199 | 40.0 |
GBIF.jl allows us to convert a set of occurrences to a DataFrame just as easily:
kf_df = DataFrame(kf_occurrences)
last(kf_df, 5)| key | name | dataset | published_in | observed_in | latitude | longitude | date | rank | observer | license | kingdom | phylum | class | order | family | genus | species | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Int64 | Abstrac…? | Abstrac…? | Abstrac…? | Abstrac…? | Abstrac…? | Abstrac…? | DateTim…? | Abstrac…? | Abstrac…? | Abstrac…? | String? | String? | String? | String? | String? | String? | String? | |
| 1 | 2573871865 | Megaceryle alcyon | iNaturalist research-grade observations | US | MX | 23.113 | -106.28 | 2020-02-11T14:05:00 | SPECIES | FRANCISCO MIGUEL FARRIOLS ESTRADA | http://creativecommons.org/licenses/by-nc/4.0/legalcode | Animalia | Chordata | Aves | Coraciiformes | Alcedinidae | Megaceryle | Megaceryle alcyon |
| 2 | 2573872043 | Megaceryle alcyon | iNaturalist research-grade observations | US | US | 29.3455 | -98.4504 | 2020-02-14T09:58:00 | SPECIES | Peter Joseph | http://creativecommons.org/licenses/by-nc/4.0/legalcode | Animalia | Chordata | Aves | Coraciiformes | Alcedinidae | Megaceryle | Megaceryle alcyon |
| 3 | 2573873683 | Megaceryle alcyon | iNaturalist research-grade observations | US | US | 31.3791 | -94.7267 | 2020-02-10T00:00:00 | SPECIES | tnewman | http://creativecommons.org/licenses/by-nc/4.0/legalcode | Animalia | Chordata | Aves | Coraciiformes | Alcedinidae | Megaceryle | Megaceryle alcyon |
| 4 | 2573873789 | Megaceryle alcyon | iNaturalist research-grade observations | US | US | 26.229 | -97.3473 | 2020-02-11T12:42:57 | SPECIES | ola10birder | http://creativecommons.org/licenses/by-nc/4.0/legalcode | Animalia | Chordata | Aves | Coraciiformes | Alcedinidae | Megaceryle | Megaceryle alcyon |
| 5 | 2573874984 | Megaceryle alcyon | iNaturalist research-grade observations | US | US | 31.0323 | -97.4213 | 2020-02-23T16:20:00 | SPECIES | A. Flinn (and M. Finn) | http://creativecommons.org/licenses/by-nc/4.0/legalcode | Animalia | Chordata | Aves | Coraciiformes | Alcedinidae | Megaceryle | Megaceryle alcyon |
We can then extract the temperature values for all the occurrences.
temperature[kf_df]1000-element Array{Float32,1}:
18.931166
18.931166
17.857887
17.08904
9.419209
12.142196
10.534634
12.142196
24.394215
14.874448
⋮
18.901596
10.245042
23.56406
9.809229
25.076925
20.466646
18.83451
22.63053
19.26551Or we can clip the layers according to the occurrences:
temperature_clip = clip(temperature, kf_df)
precipitation_clip = clip(precipitation, kf_df)SDM predictor → 336×738 grid with 81323 Float32-valued cells Latitudes (5.916666666666657, 61.750000000000014) Longitudes (-172.58333333333331, -49.749999999999986)
In case your DataFrame has different column names for the coordinates, for example :lat and :lon, you can clip it like this:
kf_df_shortnames = rename(kf_df, :latitude => :lat, :longitude => :lon)
clip(temperature, kf_df_shortnames; latitude = :lat, longitude = :lon)SDM predictor → 336×738 grid with 81323 Float32-valued cells Latitudes (5.916666666666657, 61.750000000000014) Longitudes (-172.58333333333331, -49.749999999999986)
We can finally plot the layer and occurrence values in a similar way to any DataFrame or Array. Since there are often many nothing values in the layers, it might be necessary to use filter! first:
filter!(x -> !isnothing(x.temperature) && !isnothing(x.precipitation), env_df);
histogram2d(env_df.temperature, env_df.precipitation, c = :viridis)
scatter!(temperature_clip[kf_df], precipitation_clip[kf_df],
lab= "", c = :white, msc = :orange)
To plot the occurrence values over space, you can use:
contour(temperature_clip, c = :alpine, title = "Temperature",
frame = :box, fill = true)
scatter!(kf_df.longitude, kf_df.latitude,
lab = "", c = :white, msc = :orange, ms = 2)
We can finally make a layer with the number of observations per cells:
abundance = mask(precipitation_clip, kf_occurrences, Float32)SDM response → 336×738 grid with 81323 Float32-valued cells Latitudes (5.916666666666657, 61.750000000000014) Longitudes (-172.58333333333331, -49.749999999999986)
A useful trick to visualize sites with occurrences, in contrast with sites without any occurrence, is to use replace or replace! to set the values returned as 0 or true by the function mask() to nothing. This allows us to first plot a background layer with a uniform colour, covering the whole area to visualize, then plot the occurrence layer on top using a different colour scale.
abundance_nozeros = replace(abundance, 0 => nothing)
plot(precipitation_clip, c = :lightgrey)
plot!(abundance_nozeros, c = :viridis, clim = extrema(abundance_nozeros))
Once again, the cells are rather small, and there are few observations, so this is not necessarily going to be very informative. As in our other example, to get a better sense of the distribution of observations, we can get the average number of observations in a radius of 100km around each cell (we will do so for a zoomed-in part of the map to save time):
zoom = abundance[left = -100.0, right = -75.0, top = 43.0, bottom = 20.0]
buffered = slidingwindow(zoom, Statistics.mean, 100.0)
plot(buffered, c = :lapaz, legend = false, frame = :box)
scatter!(kf_df.longitude, kf_df.latitude,
lab = "", c = :white, msc = :orange, ms = 2, alpha = 0.5)